25 research outputs found
Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction
In recent years there is a surge of interest in applying distant supervision
(DS) to automatically generate training data for relation extraction (RE). In
this paper, we study the problem what limits the performance of DS-trained
neural models, conduct thorough analyses, and identify a factor that can
influence the performance greatly, shifted label distribution. Specifically, we
found this problem commonly exists in real-world DS datasets, and without
special handing, typical DS-RE models cannot automatically adapt to this shift,
thus achieving deteriorated performance. To further validate our intuition, we
develop a simple yet effective adaptation method for DS-trained models, bias
adjustment, which updates models learned over the source domain (i.e., DS
training set) with a label distribution estimated on the target domain (i.e.,
test set). Experiments demonstrate that bias adjustment achieves consistent
performance gains on DS-trained models, especially on neural models, with an up
to 23% relative F1 improvement, which verifies our assumptions. Our code and
data can be found at
\url{https://github.com/INK-USC/shifted-label-distribution}.Comment: 13 pages: 10 pages paper, 3 pages appendix. Appears at EMNLP 201
Prompt Engineering a Prompt Engineer
Prompt engineering is a challenging yet crucial task for optimizing the
performance of large language models (LLMs). It requires complex reasoning to
examine the model's errors, hypothesize what is missing or misleading in the
current prompt, and communicate the task with clarity. While recent works
indicate that LLMs can be meta-prompted to perform automatic prompt
engineering, their potentials may not be fully untapped due to the lack of
sufficient guidance to elicit complex reasoning capabilities in LLMs in the
meta-prompt. In this work, we investigate the problem of "prompt engineering a
prompt engineer" -- constructing a meta-prompt that more effectively guides
LLMs to perform automatic prompt engineering. We introduce and analyze key
components, such as a step-by-step reasoning template and context
specification, which lead to improved performance. In addition, inspired by
common optimization concepts such as batch size, step size and momentum, we
introduce their verbalized counterparts to the meta-prompt and investigate
their effects. Our final method, named PE2, finds a prompt that outperforms
"let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the
GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction
Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world
industrial prompt. In these settings, PE2 achieves strong performance and
outperforms prior automatic prompt engineering baselines. Further, we show that
PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete
prompts, and presents non-trivial counterfactual reasoning abilities
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench
We investigate the predictability of large language model (LLM) capabilities:
given records of past experiments using different model families, numbers of
parameters, tasks, and numbers of in-context examples, can we accurately
predict LLM performance on new experiment configurations? Answering this
question has practical implications for LLM users (e.g., deciding which models
to try), developers (e.g., prioritizing evaluation on representative tasks),
and the research community (e.g., identifying hard-to-predict capabilities that
warrant further investigation).
We study the performance prediction problem on experiment records from
BIG-bench. On a random train-test split, an MLP-based predictor achieves RMSE
below 5%, demonstrating the presence of learnable patterns within the
experiment records. Further, we formulate the problem of searching for
"small-bench," an informative subset of BIG-bench tasks from which the
performance of the full set can be maximally recovered, and find a subset as
informative for evaluating new model families as BIG-bench Hard, while being 3x
smaller
Estimating Large Language Model Capabilities without Labeled Test Data
Large Language Models (LLMs) have exhibited an impressive ability to perform
in-context learning (ICL) from only a few examples, but the success of ICL
varies widely from task to task. Thus, it is important to quickly determine
whether ICL is applicable to a new task, but directly evaluating ICL accuracy
can be expensive in situations where test data is expensive to annotate -- the
exact situations where ICL is most appealing. In this paper, we propose the
task of ICL accuracy estimation, in which we predict the accuracy of an LLM
when doing in-context learning on a new task given only unlabeled data for that
task. To perform ICL accuracy estimation, we propose a method that trains a
meta-model using LLM confidence scores as features. We compare our method to
several strong accuracy estimation baselines on a new benchmark that covers 4
LLMs and 3 task collections. On average, the meta-model improves over all
baselines and achieves the same estimation performance as directly evaluating
on 40 labeled test examples per task, across the total 12 settings. We
encourage future work to improve on our methods and evaluate on our ICL
accuracy estimation benchmark to deepen our understanding of when ICL works.Comment: 14 pages, 4 figure
LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation
Successfully training a deep neural network demands a huge corpus of labeled
data. However, each label only provides limited information to learn from and
collecting the requisite number of labels involves massive human effort. In
this work, we introduce LEAN-LIFE, a web-based, Label-Efficient AnnotatioN
framework for sequence labeling and classification tasks, with an easy-to-use
UI that not only allows an annotator to provide the needed labels for a task,
but also enables LearnIng From Explanations for each labeling decision. Such
explanations enable us to generate useful additional labeled data from
unlabeled instances, bolstering the pool of available training data. On three
popular NLP tasks (named entity recognition, relation extraction, sentiment
analysis), we find that using this enhanced supervision allows our models to
surpass competitive baseline F1 scores by more than 5-10 percentage points,
while using 2X times fewer labeled instances. Our framework is the first to
utilize this enhanced supervision technique and does so for three important
tasks -- thus providing improved annotation recommendations to users and an
ability to build datasets of (data, label, explanation) triples instead of the
regular (data, label) pair.Comment: Accepted to the ACL 2020 (demo). The first two authors contributed
equally. Project page: http://inklab.usc.edu/leanlife